This report is organized as follows:
## sysname release
## "Linux" "3.10.0-1160.24.1.el7.x86_64"
## version machine
## "#1 SMP Thu Mar 25 21:21:56 UTC 2021" "x86_64"
options(na.action = "na.fail")
options(dplyr.summarise.inform = FALSE)
knitr::opts_chunk$set(cache.path = "cache/",
warning = FALSE)
library(dplyr)##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
## corrplot 0.84 loaded
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
## Loading required package: lattice
## extract trip zip contents to new dirs train & test
unzip(paste0(getwd(), "/trip_data_train.zip"), junkpaths = T, exdir = "train")
unzip(paste0(getwd(), "/trip_data_test.zip"), junkpaths = T, exdir = "test")## chr [1:1000] "0000.csv" "0001.csv" "0002.csv" "0003.csv" "0004.csv" ...
## chr [1:876] "1000.csv" "1001.csv" "1002.csv" "1003.csv" "1004.csv" ...
We start by loading the training set and inspecting the contents.
## Rows: 1,000
## Columns: 16
## $ filename <chr> "0000.csv", "0001.csv", "0002.csv", "0003.csv", "0004.csv",…
## $ feature1 <chr> "False", "False", "False", "False", "False", "False", "Fals…
## $ feature2 <chr> "False", "False", "False", "False", "False", "False", "Fals…
## $ feature3 <chr> "True", "False", "True", "True", "False", "False", "True", …
## $ feature4 <dbl> 5.209096, 4.450941, 5.396552, 4.970163, 5.266868, 5.423508,…
## $ feature5 <dbl> 9789.262, 10552.522, 10233.433, 10829.057, 10678.704, 8884.…
## $ feature6 <dbl> 30753.87, 33151.73, 32149.28, 34020.49, 33548.14, 27911.68,…
## $ feature7 <dbl> 0.0010104224, 0.0009997680, 0.0010145655, 0.0009929965, 0.0…
## $ feature8 <int> 5, 3, 6, 4, 6, 4, 6, 2, 7, 6, 5, 4, 6, 5, 7, 8, 7, 3, 8, 8,…
## $ feature9 <int> 13, 11, 13, 8, 11, 9, 11, 8, 15, 12, 5, 7, 9, 10, 8, 8, 12,…
## $ feature10 <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ feature11 <dbl> 9.373984e+03, 4.251619e+04, 1.305321e+07, 1.131908e+03, 3.1…
## $ feature12 <dbl> 3.179204e-01, 2.229321e+00, 3.425951e+01, 2.576871e+01, 1.4…
## $ feature13 <dbl> 9.379193e+03, 4.252064e+04, 1.305322e+07, 1.136878e+03, 3.1…
## $ feature14 <dbl> 4.974085, 3.151531, 6.236594, 3.968008, 5.999782, 3.974651,…
## $ y <int> 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1,…
We will convert feature1:feature3 to the factor class for modeling. We may also want y as a factor, depending on the libraries we use.
train_raw %<>%
mutate(feature1 = feature1 %>% as.factor(),
feature2 = feature2 %>% as.factor(),
feature3 = feature3 %>% as.factor(),
y_fctr = paste0("Class_", y) %>% as.factor ) %>%
select(-y)We plot the correlation matrix in order to understand linear relationships between the features
train_raw %>%
select_if(is.numeric) %>% ## exclude response and boolean types
as.matrix() %>%
cor() %>%
corrplot.mixed(lower.col = "black", number.cex = .5, tl.cex = .6)Let’s also check the 3 binary factors
## , , feature2 = False
##
## feature3
## feature1 False True
## False 439 459
## True 54 48
We make the following observations about the dataset:
feature11 and feature13 are co-linearfeature5 and feature6 are co-linearfeature8 and feature14 are co-linearfeature10 and feature2 are both zero-variance (all observations are “False” for feature2)As a result, we will remove 4 features total (feature10, feature2, and one from each co-linear pair) without any loss of information.
train_raw %<>% select(-c(feature10, feature2, feature11, feature5, feature8))
train_raw %>%
select(-c(filename, feature1, feature3, y_fctr)) %>% ## exclude response and boolean types
as.matrix() %>%
cor() %>%
corrplot.mixed(lower.col = "black", number.cex = .5, tl.cex = .6)Let’s also look at some distribution statistics for each feature. This may inform modeling decisions later on - method selection and feature transformations, for example.
## feature1 feature3 feature4 feature6 feature7
## False:898 False:493 Min. :2.116 Min. :22599 Min. :0.0009710
## True :102 True :507 1st Qu.:4.324 1st Qu.:29264 1st Qu.:0.0009934
## Median :4.991 Median :31294 Median :0.0010007
## Mean :4.985 Mean :31346 Mean :0.0010004
## 3rd Qu.:5.675 3rd Qu.:33548 3rd Qu.:0.0010069
## Max. :8.819 Max. :40275 Max. :0.0010345
## feature9 feature12 feature13 feature14
## Min. : 2.00 Min. : 0.0 Min. :4.000e+00 Min. : 0.8254
## 1st Qu.: 8.00 1st Qu.: 0.1 1st Qu.:8.010e+02 1st Qu.: 3.9806
## Median :10.00 Median : 2.7 Median :2.295e+04 Median : 4.9881
## Mean :10.05 Mean : 8845.5 Mean :8.861e+08 Mean : 4.9762
## 3rd Qu.:12.00 3rd Qu.: 78.3 3rd Qu.:6.395e+05 3rd Qu.: 6.0227
## Max. :22.00 Max. :1572134.4 Max. :6.070e+11 Max. :10.3276
## y_fctr
## Class_0:690
## Class_1:310
##
##
##
##
Most features are ‘well-behaved’. For example feature4 is reasonably symmetric without extreme outliers:
feature12 and feature13 stand out as being heavily right-skewed. See for example the density for feature12:
We may consider transforming these features later (for example, taking logs) depending on the modeling approach (some algorithms being invariant to monotonic transforms).
We also note that our response variable is binary with imbalanced classes.
We will extract new features from the trip data files. Let’s start by getting a feel for a single trip time series.
Trips have a time for each reading which tends to be taken at a 1-second frequency, with possible exception at the start of the trip; they also have a speed reading (meters per second) and a heading (clockwise angle against magnetic north), although the nominal heading may not be as important as changes in heading (turning).
Noticing there may be missing data at the beginning and ending of a trip, we will impute missing values via NOCB (next obs. carried backward) for the start, and the reverse (LOCF) at the tail. This is not necessarily a strong approach for missing data in the middle of a series, though it has the virtue that it will not fail with error.
Let’s plot the trip series
gridExtra::grid.arrange(
## speed (meters/sec) over time
trip %>%
ggplot(aes(time_seconds, speed_meters_per_second)) +
geom_path(color = "red") +
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank()),
## heading over time (clockwise due north)
trip %>%
ggplot(aes(time_seconds, heading_degrees)) +
geom_path(color = "green")
)The sharp spikes at the end of the trip stand out - they are due to the fact that with heading degrees, the difference between 359 and 0 is 1 degree, not 359.
We will use the following formula when measuring the degree of turn, denoted D:
\[D = (h_t - h_{t-1} + 540) \pmod{360} - 180\]
This optimized formulation avoids comparison logic around nominal heading and references the input values (current and previous heading) just once. Let’s take first-order differences to get a feel for acceleration, and we will use the formula above to express the change in heading.
trip %<>%
mutate(accel = speed_meters_per_second - lag(speed_meters_per_second),
heading_diff = (heading_degrees - lag(heading_degrees) + 540) %% 360 - 180)
gridExtra::grid.arrange(
## acceleration (meters/sec) over time
trip %>%
ggplot(aes(time_seconds, accel)) +
geom_path(color = "orange") +
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank()),
## turn over time (clockwise due north)
trip %>%
ggplot(aes(time_seconds, heading_diff)) +
geom_path(color = "blue")
)Now we think about identifying turns. Our intuition tells us they should be characterized by:
We will threshold a turn as having absolute heading magnitude change of 60 degrees or more. We will also exclude any instances where the average speed during the turn exceeded 20 m/s, or about 45 mph - the friction between road and tire is probably not sufficient to support turns at greater speed for most vehicles. Similarly, we will put an upper threshold on duration of the maneuver (45 seconds) so as not to misclassify long curves at lower speed.
trip %<>%
mutate(run = sign(heading_diff) * (abs(heading_diff) > .25)) %>%
group_by(runid = {runid = rle(run); rep(seq_along(runid$lengths), runid$lengths)}) %>%
mutate(tot_heading_diff = (last(heading_degrees) - first(heading_degrees) + 540) %% 360 - 180,
avg_speed = mean(speed_meters_per_second),
turn_ind = n() > 2 &
abs(tot_heading_diff) >= 60 &
avg_speed < 20 &
n() < 45) %>%
ungroup() %>%
mutate(turn_amt = case_when(turn_ind ~ tot_heading_diff, TRUE ~ NA_real_))
gridExtra::grid.arrange(
## speed (meters/sec) over time
trip %>%
ggplot(aes(time_seconds, speed_meters_per_second)) +
geom_path(color = "red") +
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank()),
## heading over time (clockwise due north)
trip %>%
ggplot(aes(time_seconds, heading_degrees)) +
geom_path(color = "green"),
trip %>%
ggplot(aes(time_seconds, heading_diff)) +
geom_path(color = "blue") +
geom_line(aes(time_seconds, turn_amt), color = "brown")
)For this trip we observe 9 turns as we have defined them.
We will define a stop as an event where speed is approximately 0 m/s for at least 3 seconds, and which is preceded by decelleration. To be preceded by decelleration, the minimum acceleration within the 15 observations immediately preceding the start of a potential stopped vehicle must be < -0.5 m/s.
trip %<>%
mutate(roll_accel_15 = lag(zoo::rollapply(accel, 15, min, align = "right", fill = NA)),
stop_run = abs(speed_meters_per_second) < .05) %>%
group_by(stop_runid = {stop_runid = rle(stop_run); rep(seq_along(stop_runid$lengths), stop_runid$lengths)}) %>%
mutate(stop_ind = n() > 2 & max(stop_run) & first(roll_accel_15) < -0.5)
gridExtra::grid.arrange(
## speed (meters/sec) over time
trip %>%
ggplot(aes(time_seconds, speed_meters_per_second, color = lead(stop_ind))) +
geom_path(aes(group = 1), show.legend = FALSE) +
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank() ),
## acceleration
trip %>%
ggplot(aes(time_seconds, accel, color = lead(stop_ind))) +
geom_path(aes(group = 1), show.legend = FALSE) +
scale_color_manual(values=c("orange", "purple"))
)The single period in which the vehicle is stopped is brief and is indicated by different colors in the series.
We now define several functions to prep data from a trip file and extract the features according to the logic above.
The prep_trip() method will accept the filename input as a string, and the partition/sub-dir in which the file appears, defaulted to “train.” It returned a data.frame which is augmented with new columns and is taken as input by the other methods below.
prep_trip <- function(trip_file, partition = "train") {
trip <- read.csv(paste0(partition, "/", trip_file))
n <- nrow(trip)
if(n == 0 | ncol(trip) != 3 | ## no rows or wrong num of cols
sum(is.na(trip[,2])) == n | sum(is.na(trip[,3])) == n | ## all missing data either col
var(trip[, 2], na.rm = T) == 0 | var(trip[,3], na.rm = T) == 0) { ## zero variance column
bad_trip <<- trip ## return bad trip data to parent environment for a look-see
stop(paste0("bad trip here: ", trip_file,
"\n Speed NA: ", sum(is.na(trip[,2])),
"\n Heading NA: ", sum(is.na(trip[,3])),
"\n Speed variance: ", var(trip[, 2], na.rm = T),
"\n Heading variance: ", var(trip[, 3], na.rm = T) ))
} ## early exit
trip %>%
imputeTS::na_locf() %>%
## turns
mutate(heading_diff = (heading_degrees - lag(heading_degrees) + 540) %% 360 - 180,
run = sign(heading_diff) * (abs(heading_diff) > .25)) %>%
group_by(runid = {runid = rle(run); rep(seq_along(runid$lengths), runid$lengths)}) %>%
mutate(tot_heading_diff = (last(heading_degrees) - first(heading_degrees) + 540) %% 360 - 180,
avg_speed = mean(speed_meters_per_second),
turn_ind = n() > 2 &
abs(tot_heading_diff) >= 60 &
avg_speed < 20 &
n() < 45) %>%
ungroup() %>%
mutate(turn_amt = case_when(turn_ind ~ tot_heading_diff, TRUE ~ NA_real_)) %>%
## stops
mutate(accel = speed_meters_per_second - lag(speed_meters_per_second),
roll_accel_15 = lag(zoo::rollapply(accel, 15, min, align = "right", fill = NA)),
stop_run = abs(speed_meters_per_second) < .05) %>%
group_by(stop_runid = {stop_runid = rle(stop_run); rep(seq_along(stop_runid$lengths), stop_runid$lengths)}) %>%
mutate(stop_ind = n() > 2 & max(stop_run) & first(roll_accel_15) < -.5) %>%
ungroup() %>%
return()
}This function takes as input the result from the prep_data() method, and returns a length 1 vector with the count of stops in the trip.
This function takes as input the result from the prep_data() method, and returns a length 1 vector with the count of turns in the trip.
This function will be used for a visual check of the turn & count logic - it applies the prep_trip() function and plots the results.
plot_trip <- function(trip_file, return_df = FALSE) {
trip <- prep_trip(trip_file)
if(return_df) return(trip) ## break
p <- arrangeGrob(
## heading over time (clockwise due north)
trip %>%
ggplot(aes(time_seconds, heading_degrees, color = turn_ind)) +
geom_path(aes(group = 1), show.legend = FALSE) +
ggtitle(trip_file) +
scale_color_manual(values=c("green", "purple")) +
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank() ),
## Change in heading
trip %>%
ggplot(aes(time_seconds, heading_diff)) +
geom_path(color = "blue") +
geom_line(aes(time_seconds, turn_amt), color = "brown") +
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank() ),
## speed (meters/sec) over time
trip %>%
ggplot(aes(time_seconds, speed_meters_per_second, color = lead(stop_ind))) +
geom_path(aes(group = 1), show.legend = FALSE) +
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank() ),
## acceleration/stops
trip %>%
ggplot(aes(time_seconds, accel, color = lead(stop_ind))) +
geom_path(aes(group = 1), show.legend = FALSE) +
scale_color_manual(values=c("orange", "blue")),
ncol = 1
)
return(p)
}I’ll take a sample of trips, 5 from each class, and apply the plotting function defined above to visually check for reasonable results.
set.seed(42)
small_samp <- train_raw %>%
select(filename, y_fctr) %>%
group_by(y_fctr) %>%
sample_n(5)
plist <- lapply(small_samp %>% pull(filename), plot_trip ) ## apply plot_trip() over sample
grid.arrange(
do.call(arrangeGrob, list(grobs = plist[1:5],
top = textGrob("0-class", gp=gpar(fontsize=15, font=3, col="red")),
ncol = 1, as.table = F) ),
do.call(arrangeGrob, list(grobs = plist[6:10],
top = textGrob("1-class", gp=gpar(fontsize=15, font=3, col="red")),
ncol = 1, as.table = F) ),
ncol = 2
)The results seem reasonable, though not perfect. For example, in 0973.csv there appears to be a stop which is missed due to a noisy speed reading, which could be due to measurement error in the sensor or a driver who likes to play with the brake pedal at lights.
One avenue for improvement will be data smoothing - e.g. via smoothing splines, simple kernels/filters, kriging, etc. - as part of a pre-processing step.
Another avenue includes techniques based on the matrix profile - specifically, shapelet and motif discovery.1 These methods involve discovering subsequences within a series which are similar to each other and/or maximally separating of classes.
For a single trip, we must first call prep_trip() which reads the relevant file from disk and returns a data.frame object. Then we call each of count_turns() and count_stops() which takes the data.frame as input and returns a length 1 vector for each. The computational time for 10 sequential iterations is shown below:
comp_eval <- function(x) {
a <- prep_trip("0001.csv")
b <- count_turns(a)
c <- count_stops(a)
return()
}
v <- small_samp %>% pull(filename)
print(paste0("Test performed for ", length(v), " files sequentially"))## [1] "Test performed for 10 files sequentially"
## user system elapsed
## 0.527 0.013 0.541
For 10 trips clock time is about half a second, but CPU (system) time is a marginal fraction of this. Most of the overhead is due to instructions and disk I/O.
The process will scale over multiple files with ~O(n) (linear time) complexity.
We turn now to the modeling analysis, right after we prepare our training data
First let’s prepare our features for the training set
train_newx <- tibble(filename = NA_character_, stop_count = NA_integer_, turn_count = NA_integer_) %>% slice(-1)
for(i in 1:nrow(train_raw)) {
f <- train_raw[i,] %>% pull(filename)
trip_result <- tryCatch({
trip <- prep_trip(f)
},
error = function(cond) {
message(paste0("prep_trip() failed for file ", f, " with following error: \n", cond))
return(cond)
} )
if(inherits(trip_result, "error")) {
train_newx %<>% add_row(filename = f, stop_count = NA_integer_, turn_count = NA_integer_)
next()
}
train_newx %<>%
add_row(filename = f,
stop_count = count_stops(trip),
turn_count = count_turns(trip) )
rm(trip, f)
}## prep_trip() failed for file 0195.csv with following error:
## Error in prep_trip(f): bad trip here: 0195.csv
## Speed NA: 0
## Heading NA: 0
## Speed variance: 0.365424790314973
## Heading variance: 0
Bind the 2 new features to the training data, stop_count and turn_count. The result will be in original order so we do not need to join.
Trip 0195 failed due to zero variance in the heading vector. Counts will be NA for this record.
We will impute the missing values in order to use all observations in modeling.
lhs <- c("turn_count", "stop_count")
rhs <- colnames(train)[!(colnames(train) %in% c(lhs, c("filename", "y_fctr")))]
imp_formula <- paste(paste(lhs, collapse = " + "), paste(rhs, collapse=" + "), sep=" ~ ") %>% as.formula()
imp_formula## turn_count + stop_count ~ feature1 + feature3 + feature4 + feature6 +
## feature7 + feature9 + feature12 + feature13 + feature14
## <environment: 0x153953e0>
train %<>% missRanger(imp_formula,
pmm.k = 5,
num.trees = 500,
sample.fraction = .6,
max.depth = 1,
data = .,
maxiter = 10L,
returnOOB = TRUE,
seed = 123,
verbose = 2)##
## Missing value imputation by random forests
##
## Variables to impute: turn_count, stop_count
## Variables used to impute: feature1, feature3, feature4, feature6, feature7, feature9, feature12, feature13, feature14
## trn_cn stp_cn
## iter 1: 1.0021 1.0014
We will use the random forest algorithm as implemented in the ranger library. We prefer a simple and computationally fast approach - random forest provides a strong off-the-shelf classifier with few hyperparameters requiring tuning (as compared to say gradient boosting or neural networks) and is extremely parallelizable.
The two main hyperparameters we will tune with repeated cross-validation are: the number of features to randomly select for each potential split (mtry); and the minimum number of observations required in each node (min.node.size). The latter controls model complexity and overfitting - we will not tune or use max.depth which also has the effect of reducing complexity of individual learners.
We will also evaluate two possible split rules - the traditional gini impurity, as well as more recent hellinger distance. The former favors splits which result in imbalanced class distribution within leaf nodes. Hellinger distance by contrast favors splits which result in an even class distribution within leaf nodes; the rule generally results in better (higher) sensitivity, with minimal loss in specificity.2
Setting up our grid and controls for tuning
k <- 10 ## folds
r <- 10 ## repeats
num.trees <- 2000 ## rf does not overfit by num.trees, but through complexity of individual learners
rfGrid <- expand.grid(mtry = 1:4,
min.node.size = seq(15, 30, 5),
splitrule = c("gini", "hellinger") )
seeds <- hsbtools::make_seeds(grid = rfGrid,
k = k,
r = r,
alpha.seed = 4242)
trnCtrl <- caret::trainControl(method = "repeatedcv",
number = k,
repeats = r,
summaryFunction = twoClassSummary,
seeds = seeds,
returnData = FALSE,
classProbs = TRUE,
savePredictions = FALSE)Run the learning procedure
rf.cv <- caret::train(y_fctr ~ .,
method = "ranger",
verbose = F,
tuneGrid = rfGrid,
trControl = trnCtrl,
num.trees = num.trees,
importance = "impurity", ## feature importance basis
metric = "ROC", ## poorly named, AUC really
data = train[, -1], ## omit filename
na.action = na.fail) ## no NAs, or else do not move past them silentlyView the estimated AUC across our hyperparameter grid
We will select the best result from the above set of evaluations - hellinger distance split rule with mtry = 2 and min.node.size = 15
Now train the model using the optimal hyperparameters. We will plot feature importance based on gini impurity. Note that our model returns class probabilities rather than class labels - we prefer this so we can choose the threshold for labeling later, which may not be 0.5 if we have unequal penalties from each type of misclassification (false negatives and false positives).
rf.1 <- ranger(y_fctr ~ .,
num.trees = num.trees,
mtry = rf.cv$bestTune$mtry,
min.node.size = rf.cv$bestTune$min.node.size,
importance = "impurity",
probability = TRUE,
splitrule = rf.cv$bestTune$splitrule,
data = train[,-1],
seed = 42)
varImp <- importance(rf.1) %>% .[order(desc(.))]
plot(varImp, xaxt="n", xlab="", ylab = "Gini importance")
axis(1,at=1:length(varImp),labels=names(varImp), las=2)feature7 provide the highest predictive contribution, feature1 and feature3 provide the least amount of information. We also note that our bespoke features extracted from the trip data contain some signal of our response, though they are also not the strongest predictors. We will look closer at the nature of these contributions in the next section.
To provide fair estimates of predictive power we want to form predictions from our model on new (unseen) data. One nice feature of random forests is that they come with a built-in hold-out sample at each iteration in the form of the out-of-bag (OOB) samples. Ranger has already created out-of-bag predictions for us which we will use to calculate a confusion matrix and the more generalized AUC.
oob_pred <- tibble(y_fctr = train$y_fctr,
p = rf.1$predictions[, 2]) %>% ## out of bag predictions
mutate(predicted_label = case_when(p >= .5 ~ "Class_1", TRUE ~ "Class_0") %>% as.factor)We assume equal cost of misclassification for each class by setting our probability threshold at 0.50 for label prediction. If we knew that false positives were more “expensive” for us than false negatives or vice versa, then this threshold should be adjusted accordingly to minimize the total cost incurred from misclassification in production.
## Confusion Matrix and Statistics
##
## Reference
## Prediction Class_0 Class_1
## Class_0 648 102
## Class_1 42 208
##
## Accuracy : 0.856
## 95% CI : (0.8327, 0.8772)
## No Information Rate : 0.69
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6444
##
## Mcnemar's Test P-Value : 8.803e-07
##
## Sensitivity : 0.6710
## Specificity : 0.9391
## Pos Pred Value : 0.8320
## Neg Pred Value : 0.8640
## Prevalence : 0.3100
## Detection Rate : 0.2080
## Detection Prevalence : 0.2500
## Balanced Accuracy : 0.8050
##
## 'Positive' Class : Class_1
##
The specificity (= 1 - false positive rate) of our classifier is ~.94 - we do a good job of avoiding “false alarms” of mislabeling cases as positive. The sensitivity (= 1 - false negative rate, aka recall) of ~.67 tells us how good we are at detecting positive cases. Precision - the proportion of positive-labeled cases which are truly positive - of .83 tells us that most of our positive predictions are accurate. By setting the decision threshold at .50, we have sought to maximize overall accuracy, estimated as ~86%.
To get the decision matrix above, we had to choose a threshold (.50) to move from a predicted probability of an observation being in Class 1 to actually labeling the observation as a 1 or 0. The ROC curve shows the tradeoff between sensitivity and specificity as we vary the threshold between 0.0 and 1.0 for labeling a case positive. In a way then it is like a continuous confusion matrix.
pROC_obj <- pROC::roc(oob_pred$y_fctr, oob_pred$p,
ci=TRUE, ci.alpha=0.8, stratified=FALSE,
plot = TRUE, auc.polygon=TRUE, max.auc.polygon=TRUE,
grid=TRUE, print.auc=TRUE, show.thres=TRUE)## Setting levels: control = Class_0, case = Class_1
## Setting direction: controls < cases
sens.ci <- pROC::ci.se(pROC_obj)
plot(sens.ci, type="shape", col="lightblue") ## low resolution shape may not fully contain ROC
plot(sens.ci, type="bars")
points(cm$byClass["Specificity"], cm$byClass["Sensitivity"], col = "red", lwd = 2)The curve shows the tradeoff between the two types of misclassifications, with the red circle indicating how we have calibrated with a .50 decision threshold. We can increase our sensitivity and correctly label more positive cases, but we will increase our mis-labeling of the negative cases at the same time. Due to the class imbalance (negative cases are about 2/3 of the data), a 1 point reduction in specificity must be offset by at least a 2 point improvement in sensitivity, roughly speaking and still assuming equal cost of errors.
We want to understand the nature of the relationship between our features and the response. Since there is just a handful of features, we can look at individual conditional expectation (ICE) plots for each. Like partial dependence plots, these show the average marginal effect of a feature, but in addition they show the marginal effect for individual observations in the training set. This helps us understand if there were important interactions between variables learned in our forest.
We are selecting 4 of the 11 features to plot below
ice <- vector("list", length = 4)
names(ice) <- names(varImp)[c(1,3,8,9)]
for (i in 1:length(ice)) {
icePartial <- partial(rf.1,
pred.var = names(ice)[i],
pred.fun = function(object, newdata) predict(object, newdata)$predictions[,2])
ice[[i]] <- plotPartial(icePartial, rug = TRUE, train = train, alpha = 0.1, ylab = "Predicted P[y=1]")
}
lapply(ice, plot)## $feature7
## NULL
##
## $feature4
## NULL
##
## $turn_count
## NULL
##
## $stop_count
## NULL
We observe that turn_count interacts with another feature in our model - the marginal effect of moving above 3 turns appears to depend on the value of some other variable. If we dig further we can find out what feature this is and use the information to specify interactions say in a parametric model - work for a future analysis.
The marginal effect of stop_count appears to be weak and perhaps counter-intuitive in its direction. We should revisit the extraction logic for this feature, or perhaps consider other features which more completely describe the nuance of driving behavior - repeated stopping at low speed, hard stopping, rapid deceleration to a near stop, etc.
Creating the test dataset
test_raw <- read.csv("model_data_test.csv") %>%
as_tibble %>%
mutate(feature1 = feature1 %>% as.factor(),
feature2 = feature2 %>% as.factor(),
feature3 = feature3 %>% as.factor() ) %>%
select(-c(feature10, feature2, feature11, feature5, feature8))
test_newx <- tibble(filename = NA_character_, stop_count = NA_integer_, turn_count = NA_integer_) %>% slice(-1)
for(i in 1:nrow(test_raw)) {
f <- test_raw[i,] %>% pull(filename)
trip_result <- tryCatch({
trip <- prep_trip(f, "test")
},
error = function(cond) {
message(paste0("prep_trip() failed for file ", f, " with following error: "))
message(cond)
return(cond)
} )
if(inherits(trip_result, "error")) {
test_newx %<>% add_row(filename = f, stop_count = NA_integer_, turn_count = NA_integer_)
next()
}
test_newx %<>%
add_row(filename = f,
stop_count = count_stops(trip),
turn_count = count_turns(trip) )
rm(trip, f)
}
test <- cbind(test_raw, test_newx[,-1])
## Check for NAs in features
apply(test, 2, function(x) sum(is.na(x)))## filename feature1 feature3 feature4 feature6 feature7 feature9
## 0 0 0 0 0 0 0
## feature12 feature13 feature14 stop_count turn_count
## 0 0 0 0 0
Labeling with the final model. Note that we assume equal cost of misclassification for each class by setting our decision bound at 0.5. If we knew that false positives were more “expensive” to us than false negatives or vice versa, this threshold should be adjusted accordingly to minimize the cost incurred from misclassification.
test_preds <- tibble(filename = test$filename,
pr_class_1 = predict(rf.1, test)[[1]][,"Class_1"]) %>%
mutate(prediction = case_when(pr_class_1 >= .5 ~ 1,
TRUE ~ 0)) %>%
select(-pr_class_1)
head(test_preds)Better cross-validated AUC with SVM.
svmGrid <- expand.grid(C = seq(2, 40, 2),
sigma = seq(.01, .05, .005))
seeds.svm <- hsbtools::make_seeds(grid = svmGrid,
k = k,
r = r,
alpha.seed = 4242)
trnCtrl.svm <- caret::trainControl(method = "repeatedcv",
number = k,
repeats = r,
summaryFunction = twoClassSummary,
seeds = seeds.svm,
returnData = FALSE,
classProbs = TRUE,
savePredictions = FALSE)
svmfit <- caret::train(y_fctr ~ .,
verbose = F,
tuneGrid = svmGrid,
trControl = trnCtrl.svm,
method = "svmRadial",
preProcess = c("center", "scale"),
importance = "impurity", ## feature importance basis
metric = "ROC", ## poorly named, AUC really
data = train[, -1], ## omit filename
na.action = na.fail,
verbose = T)
plot(svmfit)Yeh, et al.“Time series joins, motifs, discords, and shapelets: a unifying view that exploits the matrix profile.” 2017. https://www.cs.ucr.edu/~eamonn/MP_journal.pdf↩
Lyon, et al. “Hellinger Distance Trees for Imbalanced Streams.” 2014. https://arxiv.org/pdf/1405.2278.pdf↩